Considering Bayesian alternative fits with weakly informative priors
What must we do differently (Bayes vs. OLS)?
Assessing the candidate models more thoroughly, in both the training and test samples
MAPE, RMSPE, Maximum Prediction Error, Validated \(R^2\)
Incorporating multiple imputation in building a final model
431 strategy: “most useful” model?
Split the data into a development (model training) sample of about 70-80% of the observations, and a holdout (model test) sample, containing the remaining observations.
Develop candidate models using the development sample.
Assess the quality of fit for candidate models within the development sample.
Check adherence to regression assumptions in the development sample.
431 strategy: “most useful” model?
When you have candidates, assess them based on the accuracy of the predictions they make for the data held out (and thus not used in building the models.)
Select a “final” model for use based on the evidence in steps 3, 4 and especially 5.
R Packages and Data Load
knitr::opts_chunk$set(comment =NA)library(janitor)library(mice)library(naniar)library(patchwork)library(car) ## for vif function as well as Box-Coxlibrary(GGally) ## for ggpairs scatterplot matrixlibrary(broom) ## for predictions, residuals with augmentlibrary(rstanarm) ## fitting Bayesian regressionslibrary(gt) ## some prettier tableslibrary(easystats)library(tidyverse)source("c21/data/Love-431.R")theme_set(theme_bw())dm500 <-readRDS("c21/data/dm500.Rds")
What We’ve Done So Far
Imputation and Partitioning
set.seed(20241031)dm500_tenimps <-mice(dm500, m =10, printFlag =FALSE)dm500_i <-complete(dm500_tenimps, 7) |>tibble()set.seed(4312024)dm500_i_train <- dm500_i |>slice_sample(prop =0.7, replace =FALSE)dm500_i_test <-anti_join(dm500_i, dm500_i_train, by ="subject")
Fixing My Mistake from Classes 19-20
Three Regression Models We’ve Fit
Using the model training sample, and a (100/a1c) outcome transformation.
My mistake: once you decide on a transformation, create it before fitting.
dm500_i_train <- dm500_i_train |>mutate(transa1c =100/a1c)fit1 <-lm(transa1c ~ a1c_old, data = dm500_i_train)fit2 <-lm(transa1c ~ a1c_old + age, data = dm500_i_train)fit3 <-lm(transa1c ~ a1c_old + age + income, data = dm500_i_train)
Performance Indices for 3 Models
in the training sample
plot(compare_performance(fit1, fit2, fit3))
OLS Model fit1 Checking
check_model(fit1)
OLS Model fit2 Checking
check_model(fit2)
OLS Model fit3 Checking
check_model(fit3)
augment training samples
aug1 <-augment(fit1, data = dm500_i_train)aug2 <-augment(fit2, data = dm500_i_train)aug3 <-augment(fit3, data = dm500_i_train)
augment results for fit2 in our first four subjects…
aug2 |>head() |>gt()
a1c
a1c_old
age
income
subject
transa1c
.fitted
.resid
.hat
.sigma
.cooksd
.std.resid
11.6
5.6
54
Below_30K
S-168
8.62069
15.38293
-6.7622361
0.007351110
2.276248
2.145934e-02
-2.9484254
6.5
6.4
68
Higher_than_50K
S-359
15.38462
14.96010
0.4245154
0.008948617
2.305194
1.032820e-04
0.1852435
7.6
7.0
60
Between_30-50K
S-421
13.15789
14.18858
-1.0306848
0.003869153
2.304640
2.605607e-04
-0.4486063
7.3
8.0
56
Between_30-50K
S-037
13.69863
13.13196
0.5666705
0.002935108
2.305107
5.963655e-05
0.2465282
7.7
7.8
63
Between_30-50K
S-030
12.98701
13.49557
-0.5085564
0.004906357
2.305146
8.060899e-05
-0.2214648
6.4
7.0
34
Below_30K
S-055
15.62500
13.54996
2.0750381
0.020785932
2.302550
5.871383e-03
0.9109298
Bayesian fits instead?
Refit with Bayesian models?
What must we change to use Bayesian (stan_glm) fits?
Must create transformed outcome in data prior to fitting with rstanarm().
Results are a bit different for model_parameters()
Results are a bit different for model_performance()
No real change for check_models()
There is no augment() function for rstanarm() fits.
Refit with Bayesian models?
with default weakly informative priors
set.seed(20241112)fit1B <-stan_glm(transa1c ~ a1c_old, data = dm500_i_train, refresh =0)fit2B <-stan_glm(transa1c ~ a1c_old + age, data = dm500_i_train, refresh =0)fit3B <-stan_glm(transa1c ~ a1c_old + age + income, data = dm500_i_train, refresh =0)
Estimating fit1 Coefficients?
Bayesian fit
model_parameters(fit1B, ci =0.95) |>gt()
Parameter
Median
CI
CI_low
CI_high
pd
Rhat
ESS
Prior_Distribution
Prior_Location
Prior_Scale
(Intercept)
20.97661
0.95
19.910242
22.0855930
1
1.000731
4048.198
normal
13.36034
7.270914
a1c_old
-0.98244
0.95
-1.118648
-0.8463795
1
1.000882
3928.896
normal
0.00000
4.028203
OLS fit
model_parameters(fit1, ci =0.95) |>gt() |>fmt_number(decimals =3)
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
20.968
0.545
0.950
19.896
22.039
38.490
348.000
0.000
a1c_old
−0.982
0.068
0.950
−1.117
−0.847
−14.338
348.000
0.000
Estimating fit2 Coefficients?
model_parameters(fit2B, ci =0.95) |>gt() |>fmt_number(decimals =3)
Parameter
Median
CI
CI_low
CI_high
pd
Rhat
ESS
Prior_Distribution
Prior_Location
Prior_Scale
(Intercept)
19.406
0.950
17.316
21.372
1.000
1.001
4,749.509
normal
13.360
7.271
a1c_old
−0.957
0.950
−1.087
−0.817
1.000
1.001
4,638.566
normal
0.000
4.028
age
0.025
0.950
−0.002
0.052
0.962
1.000
5,188.851
normal
0.000
0.794
model_parameters(fit2, ci =0.95) |>gt() |>fmt_number(decimals =3)
Parameter
Coefficient
SE
CI
CI_low
CI_high
t
df_error
p
(Intercept)
19.423
1.019
0.950
17.420
21.427
19.064
347.000
0.000
a1c_old
−0.958
0.070
0.950
−1.095
−0.822
−13.786
347.000
0.000
age
0.025
0.014
0.950
−0.002
0.052
1.791
347.000
0.074
Estimating fit3 Coefficients?
model_parameters(fit3B, ci =0.95) |>gt() |>fmt_number(decimals =3)
Parameter
Median
CI
CI_low
CI_high
pd
Rhat
ESS
Prior_Distribution
Prior_Location
Prior_Scale
(Intercept)
19.487
0.950
17.457
21.626
1.000
1.000
3,032.152
normal
13.360
7.271
a1c_old
−0.953
0.950
−1.088
−0.815
1.000
1.000
4,061.029
normal
0.000
4.028
age
0.023
0.950
−0.004
0.051
0.953
1.000
3,711.976
normal
0.000
0.794
incomeBetween_30-50K
0.058
0.950
−0.580
0.697
0.566
1.001
2,710.381
normal
0.000
14.803
incomeBelow_30K
−0.209
0.950
−0.831
0.454
0.734
1.000
2,758.186
normal
0.000
15.100
model_parameters(fit3, ci =0.95) |>gt() |>fmt_number(decimals =3)
fit1 includes a1c_old, fit2 adds age, fit3 adds income
Similar model assumption problems (hard-to-ignore problem with linearity, maybe some non-constant variance)
training sample: fit2 performed better than the others on adjusted \(R^2\), AIC and \(\sigma\), while fit1 was best on BIC, and fit3 was best on RMSE.
test sample: fit2 performed better on MAPE, RMSPE and validated \(R^2\), while fit1 had the smallest maximum APE.
Differences between models on most metrics were modest.
Let’s Choose Model fit2
Complete Case Analysis
Project B Study 2: Report the model building process, displaying conclusions from training sample (transformation decisions, fitted parameters with CIs and interpretations, comparison of performance metrics, checks of model assumptions) and from test sample (comparison of predictive error)
Of course, here we did some imputation, so we’d need to do this all over again to get these results.
Simple Imputation
Project B Study 2: Report the model building process, displaying conclusions from training sample (transformation decisions, fitted parameters with CIs and interpretations, comparison of performance metrics, checks of model assumptions) and from test sample (comparison of predictive error)
Sometimes, we do some of the steps above without reporting them to the public, of course. But that’s not the goal here.
Model fit2 using Single Imputation
Estimated using a single imputation in training sample.
n_obs(fit2)
[1] 350
model_parameters(fit2, ci =0.95) |>gt() |>fmt_number(decimals =3) |>tab_options(table.font.size =24)
Same as Project B Study 2: Report the model building process, displaying conclusions from training sample (transformation decisions, fitted parameters with CIs and interpretations, comparison of performance metrics, checks of model assumptions) and from test sample (comparison of predictive error)
New Then add information on fitted parameters across the whole data set (not split into training and testing) after multiple (in this case, 10) imputations.
Using Ten Imputations
dm500_tenimps contained mice results across 10 imputations, then we used the 7th in our work. Let’s build a new set of results across the model we’ve settled on, with a fresh set of 10 imputations.
The original data (with missing values) are in dm500 - we need only to add our transformed outcome, 100/a1c.